Skip to content

Comments

rocmlir-tuning-driver improvements#2249

Open
mirza-halilcevic wants to merge 11 commits intodevelopfrom
tuning-driver-improvements
Open

rocmlir-tuning-driver improvements#2249
mirza-halilcevic wants to merge 11 commits intodevelopfrom
tuning-driver-improvements

Conversation

@mirza-halilcevic
Copy link
Contributor

@mirza-halilcevic mirza-halilcevic commented Feb 21, 2026

Motivation

Improvements to reduce memory usage and improve performance when tuning. Excessive memory usage causes problems on APU systems.

Technical Details

rocmlir-tuning-driver.cpp:

  • Avoid unnecessary iterations when greedy falls back to exhaustive in non-accel case
  • Create stream and allocate gpu buffers once, and initialize them with memset instead of using extra host buffers

ConcurrentQueue.h

  • Implement rate-adaptiveness so we don't accumulate more compile results than necessary

Test Plan

Several problem configs that were previously crashing on Hark Point are now able to run.

Submission Checklist

@mirza-halilcevic mirza-halilcevic marked this pull request as ready for review February 23, 2026 12:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the rocmlir-tuning-driver to reduce memory usage and improve performance during tuning operations, particularly targeting issues on APU systems with limited memory.

Changes:

  • Eliminates host buffer allocation by initializing GPU buffers directly with hipMemsetAsync
  • Implements rate-adaptive concurrent queue to provide backpressure and prevent excessive memory accumulation
  • Caches and reuses thread resources (MLIR contexts, PassManagers) across greedy tuning iterations
  • Tracks effective tuning kind to avoid unnecessary iterations when greedy mode falls back to exhaustive

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
mlir/tools/rocmlir-tuning-driver/rocmlir-tuning-driver.cpp Refactored buffer management to use GPU-direct initialization, moved stream creation outside benchmarking loop, implemented thread resource caching, and added effectiveKind tracking
mlir/tools/rocmlir-tuning-driver/ConcurrentQueue.h Added rate-adaptive queue with dynamic capacity adjustment to limit memory usage from compilation results
mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp Set effectiveKind field when creating tuning space and when falling back from Greedy to Exhaustive
mlir/include/mlir/Dialect/Rock/Tuning/RockTuning.h Added effectiveKind field to TuningParamSet struct to track actual tuning mode used

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants